Unary TEI Elements and the Token Based Corpus

نویسندگان

  • Thomas Krause
  • Carolin Odebrecht
  • Amir Zeldes
  • Florian Zipser
چکیده

The establishment of TEI as a standard for textual data generated outside of the narrow domain of corpus linguistics in history, literature, philosophy and more, has led to a fruitful integration of encoding vocabulary from different fields of interest, but at a necessary cost of a large stock of elements, heterogeneous interpretations of those elements, and limitations on the kinds of annotation combinations that a schema allows. Meanwhile in corpus and computational linguistics circles, advances in the direction of generic, vocabulary agnostic graph based models of corpus representation have gained prominence (notable examples are PAULA, Dipper 2005 and GrAF, Ide & Suderman 2007, the latter recently canonized as part of the LAF standard in ISO 24615). Graph based annotation formats lend themselves to generic, reusable query architectures, but reduce all data to having the same ontological status. Specifically, corpora in corpus linguistics center on the concept of tokens, minimal technical units of linguistic analysis, which serve as textual anchors for higher annotations (either features of the tokens, like parts of speech, or higher structures, such as syntax trees). In this paper we would like to point out a specific subset of problems caused by this dissonance between the TEI model and the token-based corpus annotation graph. We will focus on the interpretation of unary XML elements, such as line or page breaks (e.g. , ), and the representation of the underlying data structure in non-XML-based corpus query systems. Unary elements present a particular challenge for a token based corpus, since they occur within the plain text of a TEI document, yet they cover no part of the text, as shown in Figure 1.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

TEITOK: Text-Faithful Annotated Corpora

TEITOK is a web-based framework for corpus creation, annotation, and distribution, that combines textual and linguistic annotation within a single TEI based XML document. TEITOK provides several built-in NLP tools to automatically (pre)process texts, and is highly customizable. It features multiple orthographic transcription layers, and a wide range of user-defined token-based annotations. For ...

متن کامل

The Impact of Teaching Corpus-based Collocation on EFL Learners' Writing Ability

Abstract The present study explores the impact of corpus-based collocation instruction on intermediate Iranian EFL learners' writing ability. For this study, 84 Iranian learners, studying English as a foreign language in Bayan Institute, Iran, were selected and were randomly divided into two groups, experimental and control. Conventional methods of writing instruction were taught to the control...

متن کامل

The Impact of Teaching Corpus-based Collocation on EFL Learners' Writing Ability

Abstract The present study explores the impact of corpus-based collocation instruction on intermediate Iranian EFL learners' writing ability. For this study, 84 Iranian learners, studying English as a foreign language in Bayan Institute, Iran, were selected and were randomly divided into two groups, experimental and control. Conventional methods of writing instruction were taught to the control...

متن کامل

Corpus Analysis based on Structural Phenomena in Texts: Exploiting TEI Encoding for Linguistic Research

This paper poses the question, how linguistic corpus-based research may be enriched by the exploitation of conceptual text structures and layout as provided via TEI annotation. Examples for possible areas of research and usage scenarios are provided based on the German historical corpus of the Deutsches Textarchiv (DTA) project, which has been consistently tagged accordant to the TEI Guidelines...

متن کامل

An Improved Token-Based and Starvation Free Distributed Mutual Exclusion Algorithm

Distributed mutual exclusion is a fundamental problem of distributed systems that coordinates the access to critical shared resources. It concerns with how the various distributed processes access to the shared resources in a mutually exclusive manner. This paper presents fully distributed improved token based mutual exclusion algorithm for distributed system. In this algorithm, a process which...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013